Goto

Collaborating Authors

 non-linear model



On the linearity of large non-linear models: when and why the tangent kernel is constant

Neural Information Processing Systems

The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity'' of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted lazy training''. Furthermore, we show that the transition to linearity is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear. It is also not necessary for successful optimization by gradient descent.




Review for NeurIPS paper: On the linearity of large non-linear models: when and why the tangent kernel is constant

Neural Information Processing Systems

Additional Feedback: [Post Author Response] I thank the authors for responding to concerns and questions, which made me appreciate the paper better. As clarified by the authors there won't be issues with dual submission. I think the submission is good submission and will be general interest to NeurIPS community and suggest accepting. As regards to softmax, I agree with the authors when the output is softmax that current paper analysis holds. It would be interesting what would happen with softmax nonlinearities that appears in self-attention layers of Transformer architectures.


Review for NeurIPS paper: On the linearity of large non-linear models: when and why the tangent kernel is constant

Neural Information Processing Systems

This paper clarify the condition under which the NTK remains constant. First, it is pointed out that the NTK is constant if and only if the model is linear. Second, it is shown that the NTK is almost constant if the spectral norm of the Hessian is small. The Hessian norm is bounded by some conditions: linearity of output, sparse dependence of activation function, and no-bottleneck layers. Overall, this paper is well written.


Review for NeurIPS paper: Invertible Gaussian Reparameterization: Revisiting the Gumbel-Softmax

Neural Information Processing Systems

This paper presents a simple alternative to the Gumbel-Softmax based on Gaussians and invertible transformations to the hypersimplex. As one reviewer noted, "the proposed approach is simple, has nice properties, and extensible". Many reviewers criticized the lack of experiments on non-linear models in the main text. Some reviewers felt that the clarity of the draft could be improved, in particular the motivation. This was a borderline paper, however I would like to recommend acceptance.


Review for NeurIPS paper: Triple descent and the two kinds of overfitting: where & why do they appear?

Neural Information Processing Systems

The reviewers unanimously appreciated the conceptual novelty to the paper where authors separate the two potential phenomena causing non-monotonic test error behavior in terms of number of samples. This is very relevant work for the conference and as such the reviewers have provided extensive feedback. I urge the authors to take into account the detailed feedback in their revision. Additionally, below is the anonymized transcript of some interesting discussion points which I believe highlight some confusions in the paper and I strongly encourage the authors to address them. Most importantly among these please address with a mathematical proof/extensive empirical evidence the following concern raised by R1 regarding one of the main claims in the paper: The claim that the linear peak is exhibited only in the presence of noise as such is not justified in the paper (the authors cite [6] but [6] is only for linear models), I believe with non-linear RF models, there might still be variance terms from initialization and training data, in other words, it is not clear if the total variance can exhibit a linear peak even when SNR \inf (no noise).


On the linearity of large non-linear models: when and why the tangent kernel is constant

Neural Information Processing Systems

The goal of this work is to shed light on the remarkable phenomenon of "transition to linearity" of certain neural networks as their width approaches infinity. We show that the "transition to linearity'' of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian scaling applicable to the standard classes of neural networks. Our analysis provides a new perspective on the phenomenon of constant tangent kernel, which is different from the widely accepted "lazy training''. Furthermore, we show that the "transition to linearity" is not a general property of wide neural networks and does not hold when the last layer of the network is non-linear.


The Effect of Surprisal on Reading Times in Information Seeking and Repeated Reading

Klein, Keren Gruteke, Meiri, Yoav, Shubi, Omer, Berzak, Yevgeni

arXiv.org Artificial Intelligence

The effect of surprisal on processing difficulty has been a central topic of investigation in psycholinguistics. Here, we use eyetracking data to examine three language processing regimes that are common in daily life but have not been addressed with respect to this question: information seeking, repeated processing, and the combination of the two. Using standard regime-agnostic surprisal estimates we find that the prediction of surprisal theory regarding the presence of a linear effect of surprisal on processing times, extends to these regimes. However, when using surprisal estimates from regime-specific contexts that match the contexts and tasks given to humans, we find that in information seeking, such estimates do not improve the predictive power of processing times compared to standard surprisals. Further, regime-specific contexts yield near zero surprisal estimates with no predictive power for processing times in repeated reading. These findings point to misalignments of task and memory representations between humans and current language models, and question the extent to which such models can be used for estimating cognitively relevant quantities. We further discuss theoretical challenges posed by these results.